The aim of this project is to investigate the relationships and trends between the number of children per woman in a given country, and the female employment rates in that country, for nearly 200 countries. We are going to look at the female employment rates from several perspectives. We will look at and make comparisons between three employment sectors. We will also investigate two employment statuses. We will also make comparions between the female employment rates for the employment statuses as a collective on one hand, and the female employment rates for the employment statuses as a different collective on the other. We will obtain datasets, and prepare them for exploration. We will then explore the datasets, and prepare them for more comprehensive analysis. We will end the report by highlighting any patterns that we observe during our analysis, as well as the interpretations and/or significance of those patterns and trends.
This report is on the analysis of datasets from Gapminder. We are going to use female employment rate data for the agriculture, industry, and service sectors. We will also use female employment rate data for family worker and self-employed worker employment statuses. The following datasets were downloaded:
We intended to also include a third employment status, salaried workers, in the investigation but the dataset remained unavailable by the time of completion of this project.
The acquired data will be used to address the following questions:
# Importing required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline
In this section, we will load our datasets, assess whether and where trimming and data cleaning are applicable, and then execute those steps as necessary. Finally, we will merge relevant segments of the disparate datasets into the final form(s) which we will use for analysis.
In this first subsection, we will look at the general properties of our dataset, and note where further data cleaning is required. We will also indicate and lay out the precise cleaning steps that we will later carry out.
# We load the first dataset, which shows the total fertility rate by country
df_fertility = pd.read_csv("children_per_woman_total_fertility.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_fertility.head()
We can see that the dataframe contains fertility rate data for a number of countries, starting from the year 1800, with projections up to 2100, which covers a period of 301 years. We will only be interested in data up to the latest full year, and so we will drop the columns from the current year (2022) going forward. Depending on further findings below, we may also need to drop columns earlier than a certain period, which is yet to be determined at this juncture in the report.
Next, we'll check how many rows are present in the dataframe.
df_fertility.shape
This tells us that there are 202 records, presumably for 202 countries, in our dataframe.
We inspect the dataframe in further detail.
df_fertility.info()
We find that the data type for 301 of the columns is float, and 1 column likely has the string data type. This is most likely the country column.
We verify this by looking at the data types below.
df_fertility.dtypes
# Checking the data type of the first element of the country column
type(df_fertility["country"][0])
The above confirms that the first column has the string data type.
We double check that all the intermediate columns that were not shown above have the float data type.
# Aggregated counts of the data types in the dataframe
df_fertility.dtypes.value_counts()
This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_fertility.describe()
Some of the columns towards the end show the existence of missing values. We will address these later.
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_fertility.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_fertility.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not inordinately large. We conclude these are both reasonable values for the fertility rate.
We then turn to missing values.
# Checking the number of null values for each column
df_fertility.isnull().sum()
The columns towards the end show missing values as indicated earlier. We look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_fertility.isnull().sum().value_counts()
215 columns have no missing values. There are 87 columns missing 1 value each. We take a look at which columns these are.
# Mask extracting columns that have a missing value
df_fertility.isnull().sum()[df_fertility.isnull().sum() == 1]
It would seem the 87 columns from 2014 going forward are each missing 1 data point. We need to check whether only 1 record is missing all 87 values, or if the missing values are spread out across multiple rows.
# Extracting all rows that have at least 1 missing value in any column
df_fertility[df_fertility.isnull().any(axis=1)]
Only 1 row shows up as having missing values. This means there is no data for Greenland from 2014 onwards. Since we were going to drop all columns from 2022 going forward, we can modify that step to instead drop all columns starting from 2014. Doing this will allow us to still use the data for Greenland, without introducing the slight inaccuracies that can come from trying to replace the missing values.
We quickly double check that there would no longer be any missing values from anywhere else in the dataframe after that modification.
# Checking null values for subset of dataframe up to and including 2013
df_fertility.loc[:, :"2013"].isnull().sum().value_counts()
Next, we check for duplicate rows in our dataframe.
# The total number of duplicated rows across the dataframe
df_fertility.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_fertility["country"].nunique()
202 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the total female employment rate by country
df_employment = pd.read_csv("females_aged_15plus_employment_rate_percent.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_employment.head()
We can see that the dataframe contains employment rate data for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward. This dataset starts from 1991. Hence, we will need to drop columns earlier than 1991 from the fertility rate dataframe.
Next, we'll check how many rows are present in this dataframe.
df_employment.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe. We'll need to work with a final dataset which covers the same set of countries for all our indicators. As such, we will evidently need to drop some of the rows in the fertility rate dataset, and potentially in some of our other subsequent datasets as well if there are additonal disparities there.
We inspect the dataframe in further detail.
df_employment.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_employment["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_employment.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_employment.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_employment.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the employment rate.
We then check for missing values.
df_employment.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_employment.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_employment.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_employment["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the percentage of female employment in agriculture by country
df_agric = pd.read_csv("female_agriculture_workers_percent_of_female_employment.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_agric.head()
We can see that the dataframe contains data about the proportion of female employment that is in the agricultural sector, for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward.
Next, we'll check how many rows are present in this dataframe.
df_agric.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe.
We inspect the dataframe in further detail.
df_agric.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_agric["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_agric.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_agric.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_agric.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the agricultural sector employment rate.
We then check for missing values.
# Checking the number of null values for each column
df_agric.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_agric.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_agric.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_agric["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the percentage of female employment in industry by country
df_industry = pd.read_csv("female_industry_workers_percent_of_female_employment.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_industry.head()
We can see that the dataframe contains data about the proportion of female employment that is in the industry sector, for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward.
Next, we'll check how many rows are present in this dataframe.
df_industry.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe.
We inspect the dataframe in further detail.
df_industry.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_industry["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_industry.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_industry.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_industry.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the industry sector employment rate.
We then check for missing values.
# Checking the number of null values for each column
df_industry.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_industry.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_industry.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_industry["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the percentage of female employment in service work by country
df_service = pd.read_csv("female_service_workers_percent_of_female_employment.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_service.head()
We can see that the dataframe contains data about the proportion of female employment that is in the service sector, for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward.
Next, we'll check how many rows are present in this dataframe.
df_service.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe.
We inspect the dataframe in further detail.
df_service.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_service["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_service.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_service.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_service.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the service sector employment rate.
We then check for missing values.
# Checking the number of null values for each column
df_service.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_service.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_service.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_service["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the percentage of female employment with family worker status by country
df_family = pd.read_csv("female_family_workers_percent_of_female_employment.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_family.head()
We can see that the dataframe contains data about the proportion of female employment that has family worker status, for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward.
Next, we'll check how many rows are present in this dataframe.
df_family.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe.
We inspect the dataframe in further detail.
df_family.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_family["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_family.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_family.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_family.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the family worker status employment rate.
We then check for missing values.
# Checking the number of null values for each column
df_family.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_family.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_family.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_family["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
We move on to the next dataset.
# We load the next dataset, which shows the percentage of female employment with self-employed worker status by country
df_self_employed = pd.read_csv("female_self_employed_percent_of_female_employment.csv")
# We show the first few lines of the dataframe to check it loaded correctly
df_self_employed.head()
We can see that the dataframe contains data about the proportion of female employment that has self-employed worker status, for a number of countries, starting from the year 1991, up to 2019, which covers a period of 29 years. Based on our inspection of the fertility rate dataset earlier, we will only be interested in data up to 2013, and so for this dataframe we will also drop the columns from 2014 going forward.
Next, we'll check how many rows are present in this dataframe.
df_self_employed.shape
This tells us that there are 189 records, presumably for 189 countries, in our dataframe.
We inspect the dataframe in further detail.
df_self_employed.info()
We find that the data type for 29 of the columns is float, and the country column likely has the string data type.
We verify this by looking at the data type for that column below.
# Checking the data type of the first element of the country column
type(df_self_employed["country"][0])
The above confirms that the first column has the string data type. This is satisfactory. No data types will need to be changed for this dataframe.
We look at the summary statistics for the dataframe.
df_self_employed.describe()
We first check whether there are any infeasibly extreme values in any of the columns, by looking at the highest maximum value, and the lowest minimum value across the columns.
# Checking the highest value in the max row of the descriptive statistics table
df_self_employed.describe().loc["max"].max()
# Checking the lowest value in the min row of the descriptive statistics table
df_self_employed.describe().loc["min"].min()
The lowest min value is not a negative number. The highest max value is not above 100 (since the values are percentages). We conclude these are both reasonable values for the self-employed worker status employment rate.
We then check for missing values.
# Checking the number of null values for each column
df_self_employed.isnull().sum()
None of the columns show missing values. Just to be sure we are not overlooking any, we look at a summary count of the null values.
# Aggregated counts of the null values in the dataframe
df_self_employed.isnull().sum().value_counts()
This is satisfactory.
We move on to checking for duplicates.
# The total number of duplicated rows across the dataframe
df_self_employed.duplicated().sum()
We find no duplicates. But this checks for rows that are wholly duplicated and identical in all cells.
We also need to check that no individual country has been duplicated but given different values in each duplicated row.
# Checking for duplicates in country column
df_self_employed["country"].nunique()
189 unique countries matches the number of total rows we have in the dataframe. This means no country is duplicated.
All our datasets have been loaded. We move on to the next subsection.
In this subsection, we carry out the data cleaning steps we identified in the first subsection above.
We first define some custom functions that we'll use repeatedly throughout the rest of this section.
# Defining functions to extract column names to drop
def select_col_before(df, year):
'''
In this function, year is excluded from the dropped columns.
This function takes a dataframe and a column name,
and selects column names before and excluding the given name, except the first column,
which we keep because it contains the country names
'''
# Getting the index of the given column name
mask_before = df.columns.get_loc(year)
# Slicing dataframe using index
return df.columns[1:mask_before]
def select_col_after(df, year):
'''
In this function, year is included in the dropped columns.
This function takes a dataframe and a column name,
and selects column names after and including the given name
'''
# Getting the index of the given column name
mask_after = df.columns.get_loc(year)
# Slicing dataframe using index
return df.columns[mask_after:]
We begin with the first dataset. We will drop the columns from 1800 to 1990, and 2014 to 2100
# Concatenation of the different slices of the dataframe that we would like to drop
# We use our custom functions to select the ranges before 1991 and after 2013
col_to_drop = np.r_[select_col_before(df_fertility, "1991"), select_col_after(df_fertility, "2014")]
col_to_drop
# Dropping the selected columns
df_fertility = df_fertility.drop(col_to_drop, axis = 1)
df_fertility.head()
We can see that only the desired columns remain in our first dataframe. We now need to drop the columns from 2014 onwards for the rest of our dataframes.
# Lists to hold dataframes and the dataframe variable names
dfs = [df_employment, df_agric, df_industry, df_service, df_family, df_self_employed]
df_names = ["df_employment", "df_agric", "df_industry", "df_service", "df_family", "df_self_employed"]
# Loop iterates over dataframes, dropping the stated columns in place using our custom function,
# and showing the head of the result so we check that the columns were dropped correctly
for name, df in zip(df_names, dfs):
df.drop(select_col_after(df, "2014"), axis = 1, inplace = True)
print(name)
display(df.head())
print()
All our dataframes have been trimmed down to the desired range of columns. We move on to checking for disparities in the number of countries betweeen our different dataframes.
We first check whether the dataframes containing employment data contain the same set of countries by comparing all 189 rows in all 6 dataframes.
# Loop for iterating through dataframes and checking whether country columns are identical to the country column
# of the first employment dataframe
for name, df in zip(df_names, dfs):
matching = df["country"].equals(dfs[0]["country"])
print(name, "country column matches?", matching)
This indicates that all the dataframes above have data on the same set of countries.
We know the fertility rate dataframe has data for 202 countries, and the employment rate dataframe has data for 189 countries. So there are at least 13 additional countries that need to be dropped from the fertility rate dataframe. We'll also need to check how many of the remaining 189 countries then match the employment rate dataframes.
First, we show below the list of countries for which we have fertility rate data but don't have employment data.
# The mask selects the entries in the country column of the fertility rate dataframe that are not in the same column
# of the employment dataframe
countries_to_drop = df_fertility["country"][~df_fertility["country"].isin(df_employment["country"])]
countries_to_drop
These countries correspond to the following rows in the fertility rate dataframe:
df_fertility[df_fertility["country"].isin(countries_to_drop)]
We drop these countries from the fertility rate dataframe
df_fertility = df_fertility.drop(countries_to_drop.index, axis = 0)
df_fertility.head(10)
df_fertility.tail(10)
We check the number of rows now in the dataframe.
df_fertility.shape
We now need to reset the index of the dataframe.
df_fertility = df_fertility.reset_index(drop = True)
display(df_fertility.head())
display(df_fertility.tail())
We check whether all the countries in the dataframe now correspond to the countries in the other dataframes
df_fertility["country"].equals(df_employment["country"])
The remaining 189 countries match the 189 countries for which we have employment data.
This concludes our data cleaning.
In this section, we will compute relevant statistics and create relevant visualizations for our data. We will then go on to address each of our research questions.
We first need to view the descriptive statistics for all seven of our dataframes. We define a custom function.
def plot_stats(df_list, stat_titles):
'''
This function requires that the input arguments be of equal length.
This function outputs the descriptive statistics for a given list of dataframes.
'''
# Getting the current Pandas setting for maximum number of columns to display
# Default is usually 20 and could be reset to that at the end, but instead
# we are storing current value in case code is run on system with custom value and not default
current_col_max = pd.get_option("display.max_columns")
# Increasing the maximum number of columns to display to 25
# so that all statistics for our 24 columns can be viewed from the tables
pd.set_option("display.max_columns", 25)
# Loop iterates through lists of details and plot titles, computing and displaying summary statistics
for df, title in zip(df_list, stat_titles):
print("Descriptive statistics for {} data:".format(title))
display(df.describe())
# Resetting the maximum number of columns to display to previous value
pd.set_option("display.max_columns", current_col_max)
# Lists containing all dataframes and the desired titles
all_dfs = [df_fertility, df_employment, df_agric, df_industry, df_service, df_family, df_self_employed]
stats_titles = ["Fertility Rate", "Female Employment", "Female Employment in Agriculture Sector",
"Female Employment in Industry Sector", "Female Employment in Service Sector",
"Female Employment as Family Workers", "Female Employment as Self-employed Workers"]
We view the descriptive statistics for the first dataframe.
plot_stats([all_dfs[0]], [stats_titles[0]])
For the fertility rate data, the mean and median values for children per woman seem to steadily decrease from the beginning throghout the entire period. So does the maximum number of children per woman. These observations may point to a trend of decreasing fertility rate, generalized across countries. The standard deviation also decreases throughout the period, indicating that there is lower variability in the fertility rate between different countries in more recent years than in earlier years.
We look at the descriptive statistics for the next dataset.
plot_stats([all_dfs[1]], [stats_titles[1]])
For the total female employment rate data, the mean and median values increase slightly over the period under study. The standard deviation decreases slightly over the period, indicating decreasing variability in the female employment rates between different countries.
We look at the descriptive statistics for the next dataset.
plot_stats([all_dfs[2]], [stats_titles[2]])
For the employment rate data for the agriculture sector, the statistics show an overall decrease in female employment rate across countries, as per the mean and median values. There is considerable variability between countries, with the standard deviation steadily increasing, starting out almost as large, and eventually becoming larger than, the mean female employment rate.
plot_stats([all_dfs[3]], [stats_titles[3]])
For the employment rate data for the industry sector, there is also an overall decrease in female employment rate across countries, as can be seen by looking at the mean and median values. Variability is relatively unchanging over time, with the ratio of the standard deviation to the mean hovering around 2/3 throughout.
plot_stats([all_dfs[4]], [stats_titles[4]])
The values of the mean and median female employment rate in the service sector each have a significant increase over the time period. Even the minimum female employment rate in this sector rises significantly, showing that even the countries with lowest values over the years are still seeing an overall increase in female employment. The standard deviation, however, doesnt vary to the same degree as the mean, indicating a decrease in variability over time.
plot_stats([all_dfs[5]], [stats_titles[5]])
For the female employment rate data for family workers, there is a gradual decrease in average employment rate, as per the trends in the mean and median values. We also see significant variability, with the standard deviation consistently larger than the mean, and decreasing at a slower pace than the mean female employment rate.
plot_stats([all_dfs[6]], [stats_titles[6]])
The female employment rate for self-employed workers also shows an overall decrease over the duration of the period, with only a very slight decrease in the standard deviation.
We would now like to construct a single dataframe containing the data that we will use to answer our research questions. Ideally, this will be a dataframe with data for all countries, for all 7 of our indicators, for only one year. However, for our conclusions to be valid and meaningful, the year we elect to use from our data needs to be as representative of the entire time period as possible. Thus, we will now investigate whether it is feasible to extract one year that is representative of the data, for all indicators. The year we select needs to be representative of the whole in two ways: the fluctuations/movement of the data, and the values of the data.
We first look at the correlation of the years to each other, for each indicator. This will help us determine whether a single year of data can be representative of the movement of all other years of data. We will do this using a scatter plot matrix for each indicator. For efficiency, we will define a custom function to use for the plots.
def plot_scatters(df_list, scatter_titles):
'''
This function requires that the two input arguments be of equal length.
This function takes a list of dataframes and plot titles, and plots the scatter matrix for each dataframe,
labelling it with the requisite title.
'''
# This loop iterates through the list of dataframes.
for df, title in zip(df_list, scatter_titles):
# Plotting scatter matrix with specified size
pd.plotting.scatter_matrix(df, figsize=(30,30));
# Customising title to dataframe in current iteration
current_title = "Scatter Plot Matrix for {} Data, by Year".format(title)
# Text formatting and placement settings for the title
plt.suptitle(current_title, y = 0.9, weight = "bold", size = 30);
We plot the scatter matrix for the first dataframe. We make use of our custom function and the list of plot titles.
plot_scatters([all_dfs[0]], [stats_titles[0]])
The histograms for the fertility rate data show a positive skew. There is a very high positive correlation between the years, based on the scatter plots.
plot_scatters([all_dfs[1]], [stats_titles[1]])
The histograms for the total female employment rate data show a symmetrical, likely normal, distribution for all years. We also see a very high positive correlation between all years.
plot_scatters([all_dfs[2]], [stats_titles[2]])
The histograms for female employment rate in the agriculture sector both show a positive skew. All years in the period also show a very high positive correlation to each other.
plot_scatters([all_dfs[3]], [stats_titles[3]])
The histograms for female employment rate in the industry sector both show a positive skew. The correlation between the years is positive, as seen in the scatter plots, and it is again very high.
plot_scatters([all_dfs[4]], [stats_titles[4]])
The female employment rate data for the service sector show a negative skew. There is a very high positive correlation between the years over the whole period.
plot_scatters([all_dfs[5]], [stats_titles[5]])
The female employment rate for the family worker status shows a positive skew. The scatter plots show a very high positive correlation.
plot_scatters([all_dfs[6]], [stats_titles[6]])
The female employment rate for self-employed status shows possibly bimodal distributions, some of which have a stark positive skew. There is also very high positive correlation between the years.
The patterns in the spreads of the scatter plots for all our indicators suggest that each year has highest correlation to the years nearest to it in either direction, and even though the correlation between any two years decreases slightly as the years get further apart, the differences are so small that the correlation remains very high. This can be seen by looking at the correlation between the two years at the extreme ends of our period of study, 1991 and 2013, for all indicators. Therefore, it is safe to say that any of the years in the period is representative of the movement of the indicator across all other years for all countries.
We verify this conclusion by looking at heatmaps of correlation matrices, one for each dataframe. We will annotate each heatmap with the exact values, to two decimal places, of each correlation.
We start by defining a cutsom function to plot the correlation matrix heatmaps for our dataframes with the desired parameters and formatting.
def plot_corr_heatmaps(df_list, heatmap_title_list):
'''
The length of the two input arguments to this function must be the same.
This function takes a list of dataframes and a list of plot titles,
and plots correlation matrix heatmaps for each dataframe.
'''
for df, title in zip(df_list, heatmap_title_list):
plt.figure(figsize = (15, 10))
sns.heatmap(df.corr(), annot = True, cmap = "PiYG")
plt.title("Correlation Matrix Heatmap for {} Data, by Year".format(title))
plt.show()
We now plot the heatmap for our first dataframe.
plot_corr_heatmaps([all_dfs[0]], [stats_titles[0]])
The heatmap shows that each year has a perfect positive correlation with the 4 or so years closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.9, which is very high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[1]], [stats_titles[1]])
The heatmap shows that each year has a perfect positive correlation with the 2 or so years closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.91, which is very high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[2]], [stats_titles[2]])
The heatmap shows that each year has a perfect positive correlation with the 3 or so years closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.96, which is very high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[3]], [stats_titles[3]])
The heatmap shows that each year has a perfect positive correlation with the each year closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.77, which is fairly high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[4]], [stats_titles[4]])
The heatmap shows that each year has a perfect positive correlation with the 3 or so years closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.95, which is very high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[5]], [stats_titles[5]])
The heatmap shows that each year has a perfect positive correlation with each year closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.87, which is very high.
We plot the heatmap for the next dataframe.
plot_corr_heatmaps([all_dfs[6]], [stats_titles[6]])
The heatmap shows that each year has a perfect positive correlation with the 4 or so years closest to it in either direction. The lowest value, for the correlation between 1991 and 2013, is 0.97, which is very high.
The correlation matrix heatmaps for our indicators support the initial observations we made visually from the scatter plots above. Even the very lowest values still represent high positive correlation. This confirms that there is high correlation across the years in the entire period. Based on these results. we tentatively select the latest year in the period to focus our analysis on, 2013.
We are now certain our chosen year is representative of the movement of the data for all indicators, across all countries, over the entire 23 year period. We now need to check, for each indicator, whether the values of the data for the year 2013 are representative of the values of the data across the entire 23 year period, for all the countries represented in our data.
To do this, we would like to compare the values of the data for the year 2013 against the values of the average aross the period, for each country, and for each indicator.
We first create dataframes that have the countries as columns and the years as rows, by reindexing and transposing our current dataframes. For each dataframe, we make the country column the new index, transpose the reindexed dataframe, and assign the result to a new variable
transposed_fertility = df_fertility.set_index("country").transpose()
transposed_employment = df_employment.set_index("country").transpose()
transposed_agric = df_agric.set_index("country").transpose()
transposed_industry = df_industry.set_index("country").transpose()
transposed_service = df_service.set_index("country").transpose()
transposed_family = df_family.set_index("country").transpose()
transposed_self_employed = df_self_employed.set_index("country").transpose()
We now need to determine which computation of "average" as used in the previous paragraph above is most suitable for our use case. The median is a more appropriate measure of central tendency for data with a skewed ditribution. The mean is more appropriate for data with a normal distribution. We would like to check which category our data fall into by looking at the distributions of the new transposed dataframes for each indicator.
We could start by plotting histograms of the data, and do a visual inspection of the plots for each indicator to see what that yields. However, we have 189 countries per dataframe, and across 7 indicators that would result in over a thousand histogram plots to analyse. This is an unwieldy volume to accurately inspect visually to a level of thoroughness that warrants a firm conclusion. Therefore, we will instead programmatically compute the skewness for each country for each indicator.
A skewness between -0.5 and 0.5 indicates that the distribution is almost symmetrical. We will extract a count, for each indicator, of how many countries show symmetrical distribution based on the skewness values. The proportion of countries for each indicator that shows a symmetrical distribution will determine which measure of central tendency we go on to use.
The for loop below uses the skew() function from Pandas, iterating through a list of the transposed dataframes. The result from this function is as long as the number of columns in the dataframe, 189. This is too long and cumbersome to print, and will return all levels of skewnwss, without directly giving us the actual information or final result we require. Thus, we will use a mask to slice from the output of the function only the values that show symmetry, i.e. have an absolute value of 0.5 or less, and then we will print out the total number of countries meeting that criterion for each indicator.
# List to hold the transposed dataframes for use in a loop
transposed_dfs = [transposed_fertility, transposed_employment, transposed_agric, transposed_industry,
transposed_service, transposed_family, transposed_self_employed]
print("NUMBER OF COUNTRIES, FOR EACH INDICATOR, WITH SKEWNESS VALUE BETWEEN -0.5 AND 0.5:\n")
for df, title in zip(transposed_dfs, stats_titles):
print(title)
print(df.skew(axis = 0)[df.skew(axis = 0).abs() <= 0.5].value_counts().sum())
print()
We find that almost all the indicators have roughly an even split between countries which have a symmetrical distribution and countries which don't. As such, we will make our comparisons using both the median and the mean in our visualizations, to adequately cater for both kinds of distributions present in our data, and for completeness.
For each indicator, we now compare the values for the year 2013 against both the median and the mean of the values since 1991, for each of our 189 countries. We make use of line plots for this comparison, and define a custom function for efficiency.
def plot_comparison_lines(df_list, lines_titles):
'''
This function takes a list of dataframes and a list of plot titles, and for
each dataframe, plots the line graphs for the mean for 1991-2013, the median
for 1991-2013, and the 2013 values
'''
for df, title in zip(df_list, lines_titles):
plt.figure(figsize = (35, 35))
# Plot the median value of the data for the dataframe
plt.plot(df.median(axis = 0), color = "blue", label = "Median of 1991-2013 Data");
# Plot the mean value of the data for the dataframe
plt.plot(df.mean(axis = 0), color = "c", label = "Mean of 1991-2013 Data");
# Plot the data in the last column, which is the 2013 column
plt.plot(df.iloc[-1,:], color = "m", label = "2013 Data");
plt.xticks(rotation = 90);
plt.yticks(size = 12)
plt.title("{}: Comparison between 2013 data and mean and median of 1991-2013 data".format(title), size = 30)
plt.legend(fontsize = "xx-large")
We use our custom function to plot the first set of line plots.
plot_comparison_lines([transposed_dfs[0]], [stats_titles[0]])
The line plots for fertility rate data show that the values of the median and mean values for each country are very similar, with the graphs superimposed onto each other for many of the values. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[1]], [stats_titles[1]])
The line plots for total female employment rate data show that the values of the median and mean values for each country are very similar across the countries. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[2]], [stats_titles[2]])
The line plots for fertility rate data show that the values of the median and mean values for each country are very similar, with the graphs almost indistinguishable from each other for many of the values. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[3]], [stats_titles[3]])
The line plots for fertility rate data show that the values of the median and mean values for each country are very similar to each other consistently. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[4]], [stats_titles[4]])
The line plots for fertility rate data show that the values of the median and mean values for each country are very similar along the length of the axis. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[5]], [stats_titles[5]])
The line plots for fertility rate data show that the values of the median and mean values for each country are again very similar. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
plot_comparison_lines([transposed_dfs[6]], [stats_titles[6]])
For our final dataframe, once more the line plots for fertility rate data show that the values of the median and mean values for each country are very similar. Additionally, the values for 2013 alone are not very dissimilar to either the mean or the median across the entire period.
Based on our observations showing consistency for all the line plots, we conclude that the values for 2013 are represeentative of the values for the overall period under study. We will now use the values for this year to construct our new dataframe for analysis.
We extract the 2013 column from each of our 7 dataframes holding data on our indicators, and assign the countries as the index to this new dataframe.
# The column names of our new dataframe
final_columns = ["Fertility", "Employment", "Agriculture", "Industry", "Service", "Family", "Self_employed"]
# The index of our new dataframe, the list of countries
final_index = df_fertility["country"]
# Creating a new dataframe, with the required dimensions, column names and index
df_2013 = pd.DataFrame(index = final_index, columns = final_columns)
# Loop iterates through the new dataframe, populating it with the appropriate column data
for i in np.arange(len(df_2013.columns)):
# For the column of each indicator in the new dataframe, the 2013 column data of the
# corresponding dataframe for the indicator is retrieved and assigned to the column
df_2013[df_2013.columns[i]] = all_dfs[i]["2013"].values
We check the structure if our new dataframe to ensure it has been populated correctly and is as desired.
df_2013.shape
df_2013.head()
df_2013.tail()
We check for null values in the new dataframe.
df_2013.isnull().sum()
There are no null values, and this is satisfactory. We also look at the descriptive statistics, histograms and skewness of the new dataframe. We use a custom function for this.
def plot_hists_stats_skewness(df, title):
'''
This function takes a dataframe and a single string that will be used
for both the plot title and the x-axis label. The function shows
descriptive statistics, a histogram plot, and skewess for the dataframe.
'''
display(df.describe())
ax1 = df.hist();
ax1.set_xlabel(title);
ax1.set_ylabel("Number");
ax1.set_title("{} for 2013".format(title));
plt.show();
print("Skewness: ", df.skew())
We use our custom function for the first column.
plot_hists_stats_skewness(df_2013["Fertility"], stats_titles[0])
The count, minimum and maximum values are as we expect and have reasonable values. The distribution for fertility rate in the hoistogram has a distinct positive skew. This is confirmed by the high value of the skewness.
We look at the next column.
plot_hists_stats_skewness(df_2013["Employment"], stats_titles[1])
The count, minimum and maximum values are as we expect and have reasonable values. The data for total female employment rate have a distribution that is close to a normal distribution, with a slight leaning towards a negative skew. The value of the skewness supports this conclusion.
We look at the next column.
plot_hists_stats_skewness(df_2013["Agriculture"], stats_titles[2])
The count, minimum and maximum values are as we expect and have reasonable values. The distribution for female employment rate in the agriculture sector has a positive skew. This is supported by another high value of skewness.
We move on to the next column.
plot_hists_stats_skewness(df_2013["Industry"], stats_titles[3])
The count, minimum and maximum values are as we expect and have reasonable values. The distribution for female employment rate in the industry sector each shows a distinct positive skew. The very high skewness value that is greater than 1 reflects the degree of the skew.
We look at the next column.
plot_hists_stats_skewness(df_2013["Service"], stats_titles[4])
The count, minimum and maximum values are as we expect and have reasonable values. The distribution for female employment rate in the service sector shows a negative skew, though the slope seems gradual. The skewness value is right on the edge between what is considered symmetrical and asymmetrical. It is possible that this distribution is bimodal.
We look at the next column.
plot_hists_stats_skewness(df_2013["Family"], stats_titles[5])
The count, minimum and maximum values are as we expect and have reasonable values. The distributions for female employment rate as family workersis another that has a distinct positive skew. This is confirmed by another very high skewnwss value.
We look at the final column.
plot_hists_stats_skewness(df_2013["Self_employed"], stats_titles[6])
The count, minimum and maximum values are as we expect and have reasonable values. The distribution for female employment as self-employed workers is symmetrical, but possibly bimodal. The skewness value indicates symmetry.
We now go on to address our research questions.
To find out whether the number of children per woman has any impact on the rate of female employment, we look at the correlation between fertility rate and total female employment rate.
sns.heatmap(df_2013[["Fertility", "Employment"]].corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Fertility Rate & Female Employment");
There is a low positive correlation between the fertility rate and the total female employment rate.
We investigate further by splitting the data into two equal groups, the half of the countries with a fertility rate equal to or higher than the median fertility rate, and the other half of countries with a fertility rate lower than the median fertility rate. We will then assess these two groups separately to see whether there are any commonalities or relationships between them, in terms of female employment rate.
# Storing the median fertility rate
fertility_median = df_2013["Fertility"].median(axis = 0)
# Masks to use for slicing the dataframe into high fertility rate and low fertility rate groups
mask_high_fertility = df_2013["Fertility"] >= fertility_median
mask_low_fertility = df_2013["Fertility"] < fertility_median
# Using the masks, and storing the subsets in new variables
df_2013_high_fertility = df_2013[mask_high_fertility]
df_2013_low_fertility = df_2013[mask_low_fertility]
We first use a side-by-side line plot to visualize the female employment rate for the two groups. We define a custom function for this purpose.
def plot_side_by_side(df_high, df_low, title):
'''
This function takes in two dataframes, one for countries with higher fertility rate
and the other for coutnries with low fertility rate, and then creates two line plots
one for each dataframe, and on each of two adjacent sets of axes
'''
plt.figure(figsize = (35, 35));
# Creating a subplot on the left and plotting for the high fertility group
ax1 = plt.subplot(1, 2, 1);
plt.plot(df_high, color = "blue", label = "High Fertility");
plt.xticks(rotation = 90);
plt.legend(fontsize = "xx-large");
plt.ylabel("Female Employment Rate (%)", size = 18);
# Creating a subplot on the right and plotting for the low fertility group
ax2 = plt.subplot(122, sharey = ax1);
plt.plot(df_low, color = "magenta", label = "Low Fertility");
plt.xticks(rotation = 90);
plt.legend(fontsize = "xx-large");
plt.subplots_adjust(wspace = 0);
plt.suptitle("Comparison of Female Employment{} between Countries with High Fertility and Low Fertility".format(title), size = 30, y = 0.9);
# Using our custom function for the plots
plot_side_by_side(df_2013_high_fertility["Employment"], df_2013_low_fertility["Employment"], "")
From the graphs, we can see that there is variability in the female employment rate for each group. This could explain the low correlation we saw for the combined data above, if there isn't a similar level of variability in the fertility rate itself across countries. We also see that, whilst the two groups seem to range between similar values of employment rate, there seems to be more variability in the values for the countries with higher fertility than for those with lower fertility.
Next, we look at how the descriptive statistics for countries with higher fertility rate differ (or do not) from those of countries with lower fertility rate. We will also compute the coefficients of variation (CV), calculated as a ratio of the standard deviation to the mean, which allow us to make comparisons of the levels of variability for each of the indicators for each of our groups.
# Storing descriptive statistics in a variable, and then displaying them
high_employment_stats = df_2013_high_fertility[["Fertility", "Employment"]].describe()
display(high_employment_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
high_employment_CV = high_employment_stats.loc["std"] / high_employment_stats.loc["mean"]
print("Coefficients of Variation:")
display(high_employment_CV)
# Storing descriptive statistics in a variable, and then displaying them
low_employment_stats = df_2013_low_fertility[["Fertility", "Employment"]].describe()
display(low_employment_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
low_employment_CV = low_employment_stats.loc["std"] / low_employment_stats.loc["mean"]
print("Coefficients of Variation:")
display(low_employment_CV)
We find that the mean and median for the two groups are almost exactly the same. There is more variability, however, in the female employment rate of countries with higher fertility rate than those with lower fertility rate, evidenced by the higher standard deviation, and almost twice the CV, for that group. We also see from the min and max values of each group that the higher fertility group has a wider range of values for female employment rate than the lower fertility group.
Next, we check for the correlation with fertility rate for each group separately.
sns.heatmap(df_2013_high_fertility[["Fertility", "Employment"]].corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Higher Fertility Rate & Female Employment");
sns.heatmap(df_2013_low_fertility[["Fertility", "Employment"]].corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Lower Fertility Rate & Female Employment");
We find that countries with higher fertility rate have a higher positive correlation with female employment than countries with lower fertility rate. However, the correlation is still only moderate.
To investigate the relationship between fertility rate and female employment in different sectors, we look at the correlation matrix heatmap for fertility against each of the three indicators.
sns.heatmap(df_2013[["Fertility", "Agriculture", "Industry", "Service"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Fertility Rate and Female Employment by Sector");
Female employment in the industry sector has the weakest correlation to fertility rate, and it is a positive correlation. Female employment rate in the agriculture and service sectors have an almost equal correlation in magnitude, but for the agriculture sector it is positive, whereas for the service sector is negative.
We look at the line plots for the two groups of countries to compare female employment in different sectors.
# Using our custom function to create the plots
plot_side_by_side(df_2013_high_fertility["Agriculture"], df_2013_low_fertility["Agriculture"], " in Agriculture Sector")
Looking at the graphs for the agriculture sector, we see that countries with a lower fertility rate tend to have lower female employment in agriculture than countries with a higher fertility rate. We would expect the group with lower fertility to have a lower average employment rate than the other group. We also find that the countries with higher fertility rate seem to have higher variability in the female employment rate across countries, and have a wider range of values.
plot_side_by_side(df_2013_high_fertility["Industry"], df_2013_low_fertility["Industry"], " in Industry Sector")
For the industry sector, there is a larger number of visibly high outliers in the female employment rate for countries with a higher fertility rate than the countries with a lower fertility rate. There also seems to be more variability in the higher fertility rate group. However, based on visual inspection, the average employment rate seems to be similar for both groups.
plot_side_by_side(df_2013_high_fertility["Service"], df_2013_low_fertility["Service"], " in Service Sector")
For the service sector, the average female employment rate seems to be higher for the lower fertility rate group than for the higher fertility rate group of countries. The higher fertility rate group, however, has significantly higher variability in the employment rate, and also shows a wider range of values.
We investigate the validity of our observations by looking at descriptive statistics for each group.
# Storing descriptive statistics in a variable, and then displaying them
high_sector_stats = df_2013_high_fertility[["Fertility", "Agriculture", "Industry", "Service"]].describe()
display(high_sector_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
high_sector_CV = high_sector_stats.loc["std"] / high_sector_stats.loc["mean"]
print("Coefficients of Variation:")
display(high_sector_CV)
# Storing descriptive statistics in a variable, and then displaying them
low_sector_stats = df_2013_low_fertility[["Fertility", "Agriculture", "Industry", "Service"]].describe()
display(low_sector_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
low_sector_CV = low_sector_stats.loc["std"] / low_sector_stats.loc["mean"]
print("Coefficients of Variation:")
display(low_sector_CV)
For the agriculture sector, the vast differences in the median and mean values for each group confirm that there is lower female employment in the sector in countries with lower fertility rate. However, contrary to our prior observation, the statistics show that countries with lower fertility rate have a CV value over twice that of the higher fertility rate group of countries.
For the industry sector, the mean and median values for each group are asimilar, with the lower fertility rate group having slightly higher values. There is more variability for countries in the higher fertility rate group than the lower fertility group. The max value is much higher for the countries in the higher fertility group, supporting our observation about outliers having higher values in that group.
For female employment rate in the service sector, the higher fertility rate group has lower mean and median employment rates. This group also has higher variability, and a wider range of values.
We will now look at the correlation between female employment in the different sectors and fertility rate, for each group separately.
sns.heatmap(df_2013_high_fertility[["Fertility", "Agriculture", "Industry", "Service"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Higher Fertility Rate and Female Employment by Sector");
sns.heatmap(df_2013_low_fertility[["Fertility", "Agriculture", "Industry", "Service"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Lower Fertility Rate and Female Employment by Sector");
We find that for countries with higher fertility rate, there is a much stronger correlation for two sectors: a strong positive correlation for the agriculture sector and a strong negative correlation for the service sector. The countries with lower fertility rate have a weaker correlation in both cases. However, there is a weak negative correlation between fertility rate and female employment rate that is almost equal for countries in both groups.
We now turn to the relationship between fertility rate and the female employment rate for two employment statuses, family workers and self-employed workers. We'll first look at the correlation between fertility rate and each indicator.
sns.heatmap(df_2013[["Fertility", "Family", "Self_employed"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Fertility Rate and Female Employment by Status");
We see that fertility rate has a positive correlation with female employment rate for both family workers and self-employed workers. For the family workers, the correlation is moderate, while for self-employed workers, the correlation is much stronger.
We look at the line plots for the two groups of countries, the group with higher fertility and the group with lower fertility rates, to compare female employment rates for different employment statuses.
# Using our custom function to create the line plots
plot_side_by_side(df_2013_high_fertility["Family"], df_2013_low_fertility["Family"], " Rate for Family Workers")
For family workers, the graphs show that the average female employment rate for self-employed workers is lower for lower fertility rate countries than for countries with higher fertility rates. There seems to be greater variability in the values for the lower fertility group as well.
plot_side_by_side(df_2013_high_fertility["Self_employed"], df_2013_low_fertility["Self_employed"], " Rate for Self-employed Workers")
For self-employed workers, there is again a lower average employment rate for countries with lower fertility rates, although varibaility seems to be higher for the higher fertility rate group.
We now look at the descriptive statistics for each group to see whether our observations are valid.
# Storing descriptive statistics in a variable, and then displaying them
high_status_stats = df_2013_high_fertility[["Fertility", "Family", "Self_employed"]].describe()
display(high_status_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
high_status_CV = high_status_stats.loc["std"] / high_status_stats.loc["mean"]
print("Coefficients of Variation:")
display(high_status_CV)
# Storing descriptive statistics in a variable, and then displaying them
low_status_stats = df_2013_low_fertility[["Fertility", "Family", "Self_employed"]].describe()
display(low_status_stats)
# Computing the CV values using rows from the descriptive statistics table, and printing them
low_status_CV = low_status_stats.loc["std"] / low_status_stats.loc["mean"]
print("Coefficients of Variation:")
display(low_status_CV)
The median and mean values for female employment rate for family workers are higher for the higher fertility rate group than for countries with lower fertility. And while both groups have high CV values, variability is much higher for the lower fertility rate group than for the higher fertility group.
For self-employed workers, the lower fertility rate group shows lower median and mean values for female employment rate. This group shows a higher value for CV, though, than the higher fertility rate group.
We now look at the correlation between fertility rate and the employment rate for each employment status, for the two groups of countries separately.
sns.heatmap(df_2013_high_fertility[["Fertility", "Family", "Self_employed"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Higher Fertility Rate and Female Employment by Status");
For the countries with higher fertility rates, there is a strong positive correlation with the employment rate for self-employed workers, and a moderate positive correlation for family workers.
sns.heatmap(df_2013_low_fertility[["Fertility", "Family", "Self_employed"]].corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Lower Fertility Rate and Female Employment by Status");
For countries with lower fertility rates, the correlation between fertility rate and female employment rate for employment statuses is almost equal, and is weak to moderate for each employment status. In both cases, the correlation is positive.
We will now make comparisons between the relationship that fertility rate has to female employment rate for different employment sectors, versus the relationship it has to female employment rate for different employment statuses. We first look at a correlation heatmap for the fertility rate and our indicators for all countries combined.
sns.heatmap(df_2013.drop("Employment", axis = 1).corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Fertility Rate and Female Employment by Sector & Status");
We find that the fertility rate has the strongest correlation to the employment rate for self-employed workers, and it is a strong positive correlation. The two next strongest correlations are a positive correlation to the employment rate in the agriculture and a negative correlation to that of the service sector. The absolute value of the lowest correlation to an employment status, family workers, is higher than that of the lowest correlation to an employment sector, the industry sector.
We now look at the same comparisons, but this time for the two groups of countries separately.
sns.heatmap(df_2013_high_fertility.drop("Employment", axis = 1).corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Higher Fertility Rate and Female Employment by Sector & Status");
The correlation values for the countries with higher fertility rate follow the same pattern as the correlation values for the combined data. The values are also very similar.
sns.heatmap(df_2013_low_fertility.drop("Employment", axis = 1).corr()[["Fertility"]], annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between 2013 Lower Fertility Rate and Female Employment by Sector & Status");
For the countries with lower fertility rates, the correlations follow the same ranking of employment sectors and statuses from highest to lowest, but the values are significantly lower, and are over a much smaller range. For the countries with lower fertility rates, there is stronger correlation with the employment statuses than with the employment sectors, although none of the values represent particularly strong correlations in either direction.
We now look at the relationships between the female employment rates for the different sectors and the female employment rates for different employment statuses. We start with a correlation matrix heatmap.
sns.heatmap(df_2013.drop(["Fertility", "Employment"], axis = 1).corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Female Employment Rate by Sector & Status for 2013");
With data for all countries for all levels of fertility rates, we find that the female employment rate in the agriculture sector has high positive correlation to the female employment rate for both family workers and self-employed workers. The correlations of the employment rate in the service sector have the same values as those of the agriculture sector, but for the service sector it is a strong negative correlation with both employment statuses. This makes sense, given that the employment rate in the agriculture sector has a very strong correlation of almost -1 to the employment rate in the service sector. The employment rate in the industry sector has a weak negative correlation to both the employment rate for family workers and for self-employed workers.
We now look at the correlation in female employment rate between employment statuses and employment sectors for the two groups of countries separately.
sns.heatmap(df_2013_high_fertility.drop(["Fertility", "Employment"], axis = 1).corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Female Employment Rate by Sector & Status for 2013, for Higher Fertility Rates");
For the countries with higher fertility rate, there is a moderate positive correlation between the employment rate in the agriculture sector and the employment rate for family workers, and there is a high positive correlation with the employment rate for self-employed workers. These values are again mirrored in the correlation of the employment rate in the service industry to the same indicators, but the relationship is a negative correlation instead. For the employment rate in the industry sector, there is a weak negative correlation to the employment rates for both family workers and self-employed workers.
sns.heatmap(df_2013_low_fertility.drop(["Fertility", "Employment"], axis = 1).corr(), annot = True, cmap = "PiYG");
plt.title("Correlation Matrix Heatmap between Female Employment Rate by Sector & Status for 2013, for Lower Fertility Rates");
For the countries with lower fertility rate, we find that there is a very strong positive correlation of almost 1 between the female employment rate in the agriculture sector and the female employment rate for each of the employment statuses. There is a similarly very strong negative correlation between the employment rate in the service sector and the employment rates for both employment statuses. The employment rate in the industry sector has a weak positive correlation to the employment rate for each of the employment statuses.
For the first question, we found that there is a low positive correlation between the combined fertility rate and the female employment rate. When we categorized our data by fertility rate, we also found that there is a slightly higher positive correlation for the countries with higher fertility rates than for those with lower fertility rates, but both correlations were positive. This would seem to weakly indicate that the higher the fertility rate, the higher the female employment rate, but that the degree to which it is higher would be stronger for countries with higher fertility rates than for those with lower fertility rates. The descriptive statistics showed us that the mean and median values of employment rates were similar between the two groups, being slightly higher for the higher fertility rate group, but there was more variability in the countries with higher fertility. We can therefore conclude that countries with higher fertility rate show a slightly higher total female employment rate. However, we cannot conclude that higher fertility rates lead to higher female employment rates, as many more factors likely contribute, and our correlations are not enough to base that conclusion on.
For the second question, we found that the correlation between the combined fertility rate was strongest with the employment rate, positively, in the agriculture sector, and negatively, in the service sector. The correlation was weak with the industry sector. Countries in the category with lower fertility rate had weak correlations to all sectors, while for the other group the correlations were again stronger to equal degrees with the agriculture and service sectors, also in opposite directions. This indicates that countries with higher fertility rates have higher female employment rates in agriculture and lower female employment rates in service. It is possible that patterns in fertility rates lead to this, if the conditions or circumstances of employment in the service sector are not amenable to parenting for example, but this is not a conclusion we can base on this study alone and would require fruther investigation with more data on other factors. Given the start difference in correlations for the two groups, there could also be other differences between the countries i our two categories, that we are only seeing reflected in the fertility and employment rate data, but are not necessarily because of either.
For the third question, we found that there is a high positive correlation between fertility rate and the employment rate as self-employed workers, and a moderate correlation for family workers. This was for the combined fertility rate data. For the lower fertility rate group, the correlations were also positive, but were almost equal for both employment statuses. For the higher fertility rate group, the correlations mirrored those of the combined data. This could indicate that, for countries with higher fertility rates, it is more likely for female workers to be in self-employment. We were not able to obtain dat for female employment rate as salaried workers, and it would have been helpful to also have this additional comparison to fertility rates. However, for the data we have, there is a higher correlation with self-employed workers. We cannot decisively state that this is because of direct impact of fertility rate, as we only used descriptive statistics and correlations, but it is a distinnct possibility. More data and additional techniques, such as statistical modelling and inferential statistics, would be required for more definitive conclusions.
For the fourth question, we found that the three strongest correlations between employment rate and the combined fertility rate were with the self-employed worker status, the agriculture sector, and the service sector. The first two were positive correlations and the third was a negative correlation. For both groups of countries with higher and lower fertility rates, we found that the same pattern held, but overall the lower fertility rate group had much weaker correlations than the other group. Thus, we found that there were more strong correlations with the employment sectors than with the employment statuses.
This might mean that the fertility rate of a country affects the sector in which female workers get employed more than it affects their employment status. Or, in combination of the observations, it might also possibly mean that fertility rate impacts self-employment work, but impacts it strongly for the agriculture and service sectors more than it does the industry sector. However, there are likely many other factors that are beyond the scope of those studied in this project that have a causal relationship with this observation, beyond the correlations we have here. This can also be noted in the difference in magnitude of correlation between countries with higher fertility rates and lower fertillity rates. It is possible that external factors that affect the fertility rate for these two groups also affect opportunities and/or perceptions regarding female employment in general, female employment in different sectors, and female employment statuses. These could include social, political, cultural or economic factors that lead to the differences in the fertility rates, and our categorization here may be representative of that, rather than of the fertility rate itself having this impact. It would be valuable to conduct a similar investigation as the one in this project, using data where countries are categorized by similarity in these factors.
We also do not have enough information to mae concrete conclusions about whether these trends are based on patterns in preferences of the workers, or on patterns in the availability of employment options to the workers. This could provide another avenue to explore in a future investigation. It is also prudent to note that, there can be a more complete investigation of this question with the availability of employment rate data for salaried workers, which would lead to additional insights, and possibly amendments to the insights we have gained in this version of the investigation.
For the final question, we find that for both the combined and categorized data, employment rates in the agriculture and service sectors have a strong correlation with the employment rates for both self-employed and family workers. The correlations with the agriculture sector are positive, and they are negative with the service sector. There is also a very strong negative correlation between the agriculture sector and the service sector. This information in total indicates that countries with higher employment rates in agriculture have higher employment rates of family workers and self-employed workers and lower employment rates in service sector. This shows that much of the work in agriculture involves self-employment or family work, while much of the service work does not, and this intuitively makes sense.
Additionally, given the finding in the second question above that there is a positive and negative correlation between fertility rate and employment rate in agriculture and service sectors respectively, it also follows that countries with higher fertility rates see more female employment as self-employed or family workers in the agriculture sector, as it is intuitively more likely that those with and/or from larger families are involved in family work especially, as well as self-employed work. It is, again, not evident from our study alone whether this is due to preferences, or based on availability of opportunity, or some combination of both in the form of convenience for the female workers, as there are likely many factors that could contribute. This also fits the definitions given by the ILO and the OECD, regarding the scope and characteristics that classify one as a service worker, or as a family worker, contributing or unpaid. Based on these definitons, it would be very unlikely for a worker in the service sector to be self-employed, and a family worker is almost inherently self-employed.
There is a very weak correlation with the industry sector for both employment status for both combined and categorized fertility rate data. It would again have been very helpful here to investigate this sector alongside data for the employment rate for salaried workers, to see how the relationship varies there.
A prominent limitation in this project was the unavailability of data for the female employment rate for salaried workers. Including this employment status in our investigation would have proven especially useful in addressing our last two research questions. For question 4, we would have been able to have a more balanced comparison between the employment rates for the employment sectors and the employment rates for employment statuses. For question 5, we could have gained much more insight into additional relationships. For example, for the correlation between the employment rate in the industry sector and the employment rates in the different sectors, it would have been helpful to look at correlation to the employment rate for salaried workers and how it differed from the other values. Right now, in the absence of this data, we can only surmise that there could have been a stronger correlation between the industry sector and that employment status since, intuitively, workers in the industry sector are probably more likely to be salaried workers than in the agricultural and service sectors. But this is a mere supposition at best, and no conclusive positions can be reached without the data for female employment rates for salaried workers.
Another limitation is that, the employment rate data for the countries does not account for differences in the size of each employment sector across countries. It would be valuable to categorize the countries by the sector in which the country has its highest economic activity, and make comparisons within and across groups. For example, countries for which agriculture does not contribute to the major economic activity would likely have much lower employment rates in that sector than for countries which do, which could potentially distort the data and affect the conclusions we can make from that data. But if grouped with countries with similar patterns of economic activity, for instance countries that have industry as a major economic driver, it would produce much more meaningful results to make comparisons with this context, and would lead to better quality analysis for both the agriculture and the industry sectors.